Introduction

In response to a severe lack of reporting within government sources, The Washington Post compiled a database of every fatal police shooting in the United States from 2015-2022. We are interested in exploring this data, specifically as it relates to differences between U.S. states and regions.

This exploratory data analysis is divided into five main parts: first, we organize the data; second, we perform some basic statistical analyses; third, we reshape the data for state- and region-based comparative analyses; fourth, we ask a SMART research question about our data and attempt to answer this question. Finally, we will use the result of our research SMART question and impose a modeling SMART question.

In part 3 the data is reshaped and new data is added. This is for both the first part of this project (Midterm) and the latter part (Final).

To Look at the Modeling Part of this Project, Please move down to line 1053, where part 5 starts.

If you would like to only run things with

Part 1: Setting Up the Data

First we call our packages. Then we read the data set that comes from a csv file called FPS22.csv.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   0.3.5
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

After accounting for null values, the data set we are working with has 6,574 observations. Below we have provided a single sample observation:

Name Date Manner of Death Armed Age Gender Race City
Tim Elliot 10/04/2022 Shot Gun 53 M A Shelton
State Signs of Mental Illness Threat Level Flee Body Camera Longitude Latitude Is Geocoding Exact?
WA 1 TRUE Not fleeing FALSE -123 47.2 TRUE

The total number of observations:

## [1] 5720

Part 2: Basic Statistics

We provide some basic statistics about 2015-2022 fatal police shootings in the United States, using information from the Washington Post data set.

Mean age of victims of police violence:

## [1] 36.7

Median age of victims of police violence:

## [1] 34

Figure 1

Frequency graph for the age of victims of police violence:

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.

Figure 2

Frequency graph for the race of victims of police violence:

## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`

Figure 3

Frequency graph for the gender of victims of police violence:

## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`

Figure 4

Frequency graph for the manner of death of victims of police violence:

## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`

Figure 5

Frequency graph for the threat level of victims of police violence:

## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`

Figure 6

Hover over the map below to see the breakdown of fatal police shootings, divided by the race of the victim. We looked at the total number of deaths in each state by race and following are some of the insights:

  1. We see that the state with the highest level of victims of police violence is California with a total of 885 victims, followed by Texas with a total of 553 and then Florida with 427.

  2. These results are consistent with the populations of these states, with the highest being California, then Texas, and then Florida.

  3. We also observe that the highest number of deaths is for Hispanic people in California, whereas in Texas and Florida there are more fatal shootings of White people.

## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.

Figure 7

Now we look at the age of the suspect shot, as well as their race. We made the following observations:

  1. We see from the boxplot below that the median age for Black people that have been killed by police is 29 years.

  2. White people have a relatively higher median age of 35 years whereas Asian people have the highest median age of around 38 years.

Figure 8

If we look at the age of each victim against the status of their mental health, we can make the following observation: signs of mental illness appear more frequently within the 30s age range while death by police for people age 50 and above are more common for people showing signs of mental illness.

Part 3: Reshaping the Data for State and Regional Comparative Analysis

After pursuing the above exploratory analysis, we decided to do some comparative analyses between states and regions to create a specific, measureable, achievable, relevant, and time-oriented research question to pursue for the remainder of the project.

To do this, wee began by dividing the data into regions for easier visualization and comparative analysis. The regions divide each US state as follows:

Northwest (NW) Southwest (SW) Midwest (MW) Southeast (SE) Northeast (NE)
California New Mexico Illinois Georgia New York
Washington Arizona Wisconsin Alabama Rhode Island
Oregon Texas Indiana Mississippi Maryland
Nevada Oklahoma Michigan Louisiana Vermont
Idaho Hawaii Minnesota Tennessee Pennsylvania
Utah - Missouri North Carolina Maine
Montana - Iowa South Carolina New Hampshire
Colorado - Kansas Florida New Jersey
Wyoming - North Dakota Arkansas Connecticut
Arkansas - South Dakota West Virginia Massachusetts
Arkansas - Nebraska DC -
- - Ohio Virginia -

Fatal shootings in the Northwest United States:

## [1] 1551

Fatal shootings in the Southwest United States:

## [1] 1058

Fatal shootings in the Midwest United States:

## [1] 955

Fatal shootings in the Southeast United States:

## [1] 1668

Fatal shootings in the Northeast United States:

## [1] 488

We also create some new columns for the Modeling Portion (Part 5)

The first new block of data we will add will be state level spending per capita on police for the year 2021

Then we will add a new binary variable that have laws that mandate an police officer using their body camera when interacting with members of the public. (For Part 5)

Here we are adding data about the direction a state swung in the 2020 election (Negative = Swing to Trump, Positive = Swing to Biden) (for Part 5)

Finally We will add a variable based on police officers per 100K citizens by state (For part 5)

There will be basic EDA of these new variables within part 5.

We then created two sub-data sets by grouping the data by state and by region for visualization purposes. The contents of both groups are identical, besides their grouping.

Part 4: Research SMART Question and Answer

Within our data set of 6,574 observations of police shootings from 2015 to 2022 in the United States, is there a correlation between the U.S. state of observation and whether a body camera was turned on during the shooting?

First let’s take a look at our data after it has been grouped by state and reorganized into the following variables:

Variable Meaning
state State of observation
region Region of observation
stbcp Body camera on proportion by state
genp.p Proportion of male victims by state
smi.p Proportion of victims by state with signs of mental illness
flee.p Proportion of victims by state the were fleeing
att.p Proportion of victims by state that were attacking
armed.p Proportion of victims by state that were armed
MoD.p Proportion of victims by state that were shot
age.avg Average age by state
Non_White_Prop Proportion of non-White victims by state

The state data subgroup can be summarized as follows:

##     state              month               year           regions  
##  Length:5720        Length:5720        Length:5720        MW: 955  
##  Class :character   Class :character   Class :character   NE: 488  
##  Mode  :character   Mode  :character   Mode  :character   NW:1551  
##                                                           SE:1668  
##                                                           SW:1058  
##                                                                    
##     spendpc         bclaw          marg2020      le_per_100k      stbcp      
##  Min.   : 390   Min.   :0.000   Min.   :-43.0   Min.   :284   Min.   :0.000  
##  1st Qu.: 526   1st Qu.:0.000   1st Qu.:-12.0   1st Qu.:379   1st Qu.:0.106  
##  Median : 608   Median :0.000   Median :  0.2   Median :439   Median :0.132  
##  Mean   : 650   Mean   :0.017   Mean   :  1.5   Mean   :441   Mean   :0.143  
##  3rd Qu.: 704   3rd Qu.:0.000   3rd Qu.: 16.0   3rd Qu.:479   3rd Qu.:0.180  
##  Max.   :1337   Max.   :1.000   Max.   : 87.0   Max.   :722   Max.   :0.429  
##      gen.p           smi.p           flee.p      att.p          armed.p     
##  Min.   :0.800   Min.   :0.000   Min.   :0   Min.   :0.400   Min.   :0.818  
##  1st Qu.:0.937   1st Qu.:0.197   1st Qu.:0   1st Qu.:0.577   1st Qu.:0.912  
##  Median :0.946   Median :0.232   Median :0   Median :0.643   Median :0.929  
##  Mean   :0.951   Mean   :0.230   Mean   :0   Mean   :0.639   Mean   :0.931  
##  3rd Qu.:0.965   3rd Qu.:0.268   3rd Qu.:0   3rd Qu.:0.684   3rd Qu.:0.955  
##  Max.   :1.000   Max.   :0.714   Max.   :0   Max.   :1.000   Max.   :1.000  
##      MoD.p          age.avg     Non_White_prop 
##  Min.   :0.800   Min.   :31.7   Min.   :0.000  
##  1st Qu.:0.934   1st Qu.:35.4   1st Qu.:0.370  
##  Median :0.944   Median :36.6   Median :0.509  
##  Mean   :0.949   Mean   :36.7   Mean   :0.490  
##  3rd Qu.:0.968   3rd Qu.:38.2   3rd Qu.:0.587  
##  Max.   :1.000   Max.   :47.0   Max.   :0.931

The region data subgroup can be summarized as follows:

##     state              month               year              spendpc    
##  Length:5720        Length:5720        Length:5720        Min.   : 390  
##  Class :character   Class :character   Class :character   1st Qu.: 526  
##  Mode  :character   Mode  :character   Mode  :character   Median : 608  
##                                                           Mean   : 650  
##                                                           3rd Qu.: 704  
##                                                           Max.   :1337  
##      bclaw          marg2020      le_per_100k      stbcp           gen.p      
##  Min.   :0.000   Min.   :-43.0   Min.   :284   Min.   :0.000   Min.   :0.800  
##  1st Qu.:0.000   1st Qu.:-12.0   1st Qu.:379   1st Qu.:0.106   1st Qu.:0.937  
##  Median :0.000   Median :  0.2   Median :439   Median :0.132   Median :0.946  
##  Mean   :0.017   Mean   :  1.5   Mean   :441   Mean   :0.143   Mean   :0.951  
##  3rd Qu.:0.000   3rd Qu.: 16.0   3rd Qu.:479   3rd Qu.:0.180   3rd Qu.:0.965  
##  Max.   :1.000   Max.   : 87.0   Max.   :722   Max.   :0.429   Max.   :1.000  
##      smi.p           flee.p      att.p          armed.p          MoD.p      
##  Min.   :0.000   Min.   :0   Min.   :0.400   Min.   :0.818   Min.   :0.800  
##  1st Qu.:0.197   1st Qu.:0   1st Qu.:0.577   1st Qu.:0.912   1st Qu.:0.934  
##  Median :0.232   Median :0   Median :0.643   Median :0.929   Median :0.944  
##  Mean   :0.230   Mean   :0   Mean   :0.639   Mean   :0.931   Mean   :0.949  
##  3rd Qu.:0.268   3rd Qu.:0   3rd Qu.:0.684   3rd Qu.:0.955   3rd Qu.:0.968  
##  Max.   :0.714   Max.   :0   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##     age.avg     Non_White_prop 
##  Min.   :31.7   Min.   :0.000  
##  1st Qu.:35.4   1st Qu.:0.370  
##  Median :36.6   Median :0.509  
##  Mean   :36.7   Mean   :0.490  
##  3rd Qu.:38.2   3rd Qu.:0.587  
##  Max.   :47.0   Max.   :0.931

We will now check our data for normality:

Because the plot is relatively linear, we can conclude this data is close enough to normality for our purpose.

Now let us look at the body camera proportions by state. In the below bar graph, TRUE signifies a police body camera that was on, while FALSE indicates the body camera was off:

Number of fatal shootings where the body camera was on:

##   body_camera   n
## 1        TRUE 905

Number of fatal shootings where the body camera was off:

##   body_camera    n
## 1       FALSE 5383

This scatter plot shows the proportion of fatal shootings when cameras were on by state (the variable stbcp). Each point on the graph depicts a state’s proportion of shootings where the police body camera was turned on during the incident). We can see that there is very little variation in Southwest, and many differences among states in the Midwest.

Finally, let us check out the mean body camera on proportion for all states:

## [1] 0.143

And the stbcp median body camera on proportion for all states:

## [1] 0.132

We will now perform a chi-square test to see if there is a significant difference between the proportions of each state.

Null: There is no significant differences between US States in the proportion of body cameras being turned on during police shootings

Alternative: There is a significant difference between US State in the proportion of body cameras being turned on during police shootings

Significance Level: a = 0.05

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.106   0.132   0.143   0.180   0.429
## 
##  Pearson's Chi-squared test
## 
## data:  contable
## X-squared = 3e+05, df = 2250, p-value <2e-16

With a p-value of 2e-16, we easily pass our significance level of alpha=0.05 and have shown that there exists significant differences between different states’ proportions of body camera usage during fatal police shootings.

This exploratory data analysis has shown that there is significant difference in the level of body camera usage in police shootings between states and regions in the United States. We intend to delve into the reasons why there are differences and research what factors may explain these differences between states. This will require understanding state laws and policies regarding the use of police body cameras. We must also understand the police force consequences for turning off body cameras during police activity in different states.

Studying the use of body cameras in police work is an important topic of study for data-driven policy research in the United States. We hope to be able to apply this correlation between the U.S. state of observation and whether the body camera was on or off during the shooting to state policy on body cameras during police work.

Part 5:

Because of our findings in Part 4, we know there are significant differences in the level of body camera usage in police shootings between US states, but let us see if we can find out what drives those differences.

Our second SMART question:

For the years 2021 and 2022, do

-US Region, -Law Enforcement Officers per capita, -Law Enforcement spending per capita, -body camera mandate laws -2020 presidential election leaning have any influence on a state’s proportion of body camera usage ?

We will use multiple linear regression to build various models to see if any of these variables can be useful predictors.

Note: Because most states that have body camera laws had them take affect in the start of 2021, so we will only be looking at data from 2021 and 2022. This reduces the number of cases in our original data set to 1763. (This is below the 4000 observation threshold, but was approved in class by Professor Faruque)

5.1: Introduction to the New Data

First let us take a look at the new dataset with its new variables (added in Part 3):

## # A tibble: 6 × 17
## # Groups:   state [6]
##   state month year  regions spendpc bclaw marg2020 le_per_1…¹  stbcp gen.p smi.p
##   <chr> <chr> <chr> <fct>     <dbl> <dbl>    <dbl>      <dbl>  <dbl> <dbl> <dbl>
## 1 WA    10    2022  NW          608     0       19       320. 0.112  0.958 0.336
## 2 OR    10    2022  NW          736     0       16       284. 0.0833 0.979 0.302
## 3 KS    10    2022  MW          553     0      -15       467. 0.133  0.917 0.217
## 4 CA    10    2022  NW          981     0       29       378. 0.184  0.946 0.232
## 5 CO    10    2022  NW          664     0       14       417. 0.124  0.965 0.139
## 6 OK    10    2022  SW          487     0      -33       410. 0.180  0.976 0.216
## # … with 6 more variables: flee.p <dbl>, att.p <dbl>, armed.p <dbl>,
## #   MoD.p <dbl>, age.avg <dbl>, Non_White_prop <dbl>, and abbreviated variable
## #   name ¹​le_per_100k
## [1] 1763

Now for some light EDA to look at the new data:

Histogram on 2020 Election Swing Margin

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram on police spending per capita (for the year 2021)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram on Law Enforcement Officers per 100K citizens

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The following 3D plot compares the following variables: bclaw,stbcp,marg2020

Here are some multivariate models to visualize how the new data relates:

Law Enforcement Officers Per 100K citizens vs 2020 Election Margin

State Body Camera Proportion vs Law Enforcement Officers Per 100K citizens

State Body Camera Proportion vs 2020 Election Margin

State Body Camera Proportion vs US Regions

## Warning: Using size for a discrete variable is not advised.

US Regions vs Law Enforcement Officers Per 100K citizens

## Warning: Using size for a discrete variable is not advised.

The following 3D plot compares the following variables: Body Camera Proportion, 2020 Elecion Margin, and The State’s Region

And Finally, here is a 3-D plot of Law Enforcement Officer’s per 100K citizens, Police Spending, and a State’s Region

Part 5.2: The Models

Now that we are familiar with the data, we can start to model with our new state-wide data.

This is model 1: A simple MLRG model that uses all the new variables along with the region variable:

## 
## Call:
## lm(formula = stbcp ~ (marg2020 + bclaw + regions + le_per_100k + 
##     spendpc), data = FD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14481 -0.02162 -0.00773  0.00831  0.29707 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.09e-01   1.13e-02    9.68  < 2e-16 ***
## marg2020     1.77e-04   1.22e-04    1.45  0.14625    
## bclaw        6.37e-02   1.06e-02    6.00  2.4e-09 ***
## regionsNE   -3.52e-02   6.83e-03   -5.16  2.8e-07 ***
## regionsNW   -2.00e-02   5.85e-03   -3.42  0.00064 ***
## regionsSE   -2.86e-02   4.74e-03   -6.03  2.0e-09 ***
## regionsSW   -4.44e-03   4.97e-03   -0.89  0.37116    
## le_per_100k -7.15e-05   2.70e-05   -2.64  0.00825 ** 
## spendpc      1.27e-04   1.48e-05    8.58  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0579 on 1754 degrees of freedom
## Multiple R-squared:  0.183,  Adjusted R-squared:  0.179 
## F-statistic: 49.1 on 8 and 1754 DF,  p-value: <2e-16
##             GVIF Df GVIF^(1/(2*Df))
## marg2020    2.99  1            1.73
## bclaw       1.09  1            1.04
## regions     5.43  4            1.24
## le_per_100k 2.19  1            1.48
## spendpc     3.97  1            1.99
##        res
## 1 -0.03482
## 2 -0.08165
## 3 -0.00990
## 4 -0.00773
## 5 -0.02162
## 6  0.04839
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The VIF values for model 1 are all within acceptable range. With an R^2 of 0.183, this model is not very good at predicting statewide body camera usage. We can see that the region variable is not helpful so we will remove it.

Model 2 uses only the most helpful predictors from the previous model.

## 
## Call:
## lm(formula = stbcp ~ (marg2020 + bclaw + spendpc + le_per_100k), 
##     data = FD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14719 -0.02411 -0.00778  0.01628  0.27593 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.15e-01   1.10e-02   10.38  < 2e-16 ***
## marg2020     1.16e-04   1.12e-04    1.04      0.3    
## bclaw        4.81e-02   1.05e-02    4.56  5.4e-06 ***
## spendpc      1.16e-04   1.18e-05    9.81  < 2e-16 ***
## le_per_100k -1.06e-04   1.88e-05   -5.65  1.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0589 on 1758 degrees of freedom
## Multiple R-squared:  0.152,  Adjusted R-squared:  0.15 
## F-statistic: 78.5 on 4 and 1758 DF,  p-value: <2e-16
##    marg2020       bclaw     spendpc le_per_100k 
##        2.42        1.03        2.45        1.02
##        res
## 1 -0.04151
## 2 -0.08839
## 3  0.00578
## 4 -0.00778
## 5 -0.02468
## 6  0.05588
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The VIF values for model 2 are all within acceptable range. With an R^2 of 0.152, this model is even worse at predicting statewide body camera usage.

Let us try analyzing the interaction of law enforcement spending per capita and officers per capita.

Model 3 is just Model 1 again with the aforementioned interaction.

## 
## Call:
## lm(formula = stbcp ~ (marg2020 + bclaw + regions + le_per_100k + 
##     spendpc + I(spendpc * le_per_100k)), data = FD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13141 -0.01977  0.00129  0.00907  0.27827 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5.19e-01   2.93e-02   17.72  < 2e-16 ***
## marg2020                  6.30e-04   1.19e-04    5.30  1.3e-07 ***
## bclaw                     4.97e-02   1.00e-02    4.96  7.8e-07 ***
## regionsNE                -4.50e-02   6.46e-03   -6.96  4.8e-12 ***
## regionsNW                -2.33e-03   5.63e-03   -0.41  0.67844    
## regionsSE                -1.63e-02   4.53e-03   -3.59  0.00034 ***
## regionsSW                 9.48e-03   4.77e-03    1.99  0.04690 *  
## le_per_100k              -9.59e-04   6.43e-05  -14.90  < 2e-16 ***
## spendpc                  -4.55e-04   4.12e-05  -11.05  < 2e-16 ***
## I(spendpc * le_per_100k)  1.22e-06   8.13e-08   15.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0545 on 1753 degrees of freedom
## Multiple R-squared:  0.276,  Adjusted R-squared:  0.272 
## F-statistic: 74.3 on 9 and 1753 DF,  p-value: <2e-16
##        res
## 1 -0.07073
## 2 -0.09132
## 3  0.00790
## 4  0.00515
## 5 -0.03706
## 6  0.04312
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We are ignoring the VIF test for multicolinearity because we are using an interaction predictor. With an R^2 of 0.276, this model is not good, much better than the others at predicting statewide body camera usage.

Part 5.3 Predicting New Data

Since lm3 is our best model (per our R^2), lets try to predict a few made up New US states:

Please notice Eleum and Faraam are identical except their body camera laws same as GW and HW

Now lets plug these new “states” into our model:

##     fit   lwr   upr
## 1 0.204 0.183 0.226
##     fit   lwr   upr
## 1 0.199 0.186 0.213
##     fit   lwr   upr
## 1 0.212 0.189 0.234
##     fit   lwr   upr
## 1 0.271 0.243 0.299
##     fit   lwr   upr
## 1 0.127 0.117 0.137
##     fit   lwr   upr
## 1 0.177 0.155 0.198
##     fit   lwr   upr
## 1 0.129 0.121 0.138
##     fit   lwr   upr
## 1 0.179 0.159 0.199

We can see the difference of fit on states E and F as well as G and H and see the effect body camera laws have.

Part 5.4: Conclusions

Though lm3 is our best model, it still is not a great predictor of statewide body camera usage, which can lead us to the following conclusions: